Journal of Cheminformatics — Latest Matching Preprints

1

BBB-Nuke: Transport-Aware Prediction of Blood-Brain Barrier Penetration in Small Molecules

Abasciano, N.; Hadipour, H.; Poddar, A.; Rudrum, J.; Sobodu, T.

2026-07-14 bioengineering 10.64898/2026.07.13.738280 medRxiv

Top 0.1%

35.9%

Show abstract

Predicting blood-brain barrier (BBB) penetration remains a central challenge in CNS drug discovery. Existing computational models rely on physicochemical descriptors and are blind to active transport biology - the efflux pumps and carrier proteins that dominate drug exclusion at the BBB in vivo. We present BBB-Nuke, a modular prediction pipeline that integrates physicochemical scoring with explicit efflux transporter substrate modeling. The system computes ten molecular descriptors, predicts ionization state via a graph convolutional network, scores CNS-MPO desirability, and estimates substrate probability for seven efflux transporters (P-gp/MDR1, BCRP/ABCG2, MRP1, MRP2, MRP4, MATE1, OAT3) using Random Forest classifiers trained on curated ChEMBL bioactivity data. A gradient-boosted classifier trained on 67 features - ten physicochemical, seven efflux transporter probabilities, and fifty fingerprint-derived principal components - achieves an area under the receiver operating characteristic curve (AUROC) of 0.933 {+/-} 0.006 under five-fold cross-validation on 9,262 labeled compounds, and 0.810 on a fully held-out benchmark of 470 clinically validated compounds. In head-to-head comparisons, BBB-Nuke outperforms CNS-MPO, LightBBB, ADMETlab 2.0, and BBB-Score on both cross-validation and external test sets. We apply the pipeline to screen over one billion commercially available compounds from the Enamine REAL library and PubChem, identifying enriched regions of BBB-penetrant chemical space and characterizing the structural features that distinguish permeable from excluded molecules. BBB-Nuke is freely available as a Python package, REST API, and Model Context Protocol server.

2

Reference-free compound identification using computational prediction of molecular properties and multi-dimensional spectrometric measurements: a fentanyl case study

Harrilal, C. P.; Hollerbach, A. L.; Ciesielski, D.; Schultz, K. J.; Overstreet, R.; Rice, P. S.; King, E.; Nguyen, J.; Ross, D. H.; Lin, V. S.; Deng, G. Y.; Brayfindley, E.; Webb-Robertson, B.-J.; Raugei, S.; Ibrahim, Y. M.; Ewing, R. G.; Metz, T.

2026-04-27 scientific communication and education 10.64898/2026.04.22.719980 medRxiv

Top 0.1%

18.7%

Show abstract

Mass spectrometry is used to identify chemicals to which humans are exposed, but it cannot directly determine molecular structures. Instead, structures are inferred by matching experimental spectra to libraries of spectra constructed from analyses of pure reference compounds. However, the chemical space of human exposures far exceeds the amount of experimental library spectra. Here, we evaluate a reference-free strategy for confident identification of unknown molecules. Using fentanyl as a case study, we created a suspect library of over 1 billion computationally predicted fentanyl analogs and predicted molecular properties through machine learning, molecular dynamics, and density functional theory. Multi-dimensional spectra from a blinded analysis of a mock fentanyl tablet were matched with the predicted library, yielding an average of three candidate structures per measured analog, with six exact identifications. This work emphasizes the promise of reference-free molecular measurements for assessing human exposure by merging computational predictions with high-dimensional measurements.

3

BBBP_Atlas: Unified Interpretable Modeling of Blood Brain Barrier Permeability across Small Molecules and Peptides

Shen, X.; Su, Q.; Luo, H.; Gou, Q.; Ge, J.; Hou, T.; Wang, J.; Kang, Y.

2026-07-09 bioinformatics 10.64898/2026.07.06.736742 medRxiv

Top 0.1%

13.1%

Show abstract

Accurate prediction of blood-brain barrier permeability (BBBP) is essential for central nervous system drug discovery, yet existing models are often limited by their reliance on predefined physicochemical descriptors, small-molecule-centered training sets, or conformation-dependent representations, which restricts their transferability across chemically diverse modalities especially peptides. In addition, publicly available BBBP datasets remain fragmented, inconsistently standardized, and weakly controlled for molecular redundancy, increasing the risk of data leakage and overestimated model performance. In this study, we propose BBBP-Atlas, a structure-aware BBB permeability prediction model designed for unified modeling of small molecules and peptides with the first cross-modal dataset OmniBBBP. Designed to bypass descriptor and conformation dependencies, our model represents standardized molecular structures as atom-level graphs to capture local atom-bond environments and long-range topological dependencies associated with BBB transport. This design enables direct learning of structure-permeability relationships from molecular topology. For model training and evaluation, we curated a cross-modal, redundancy-filtered database OmniBBBP that seamlessly unifies small molecules and complex peptides, containing 10,218 unique compounds with 9,316 small molecules and 902 peptides. BBBP-Atlas achieved an accuracy of 0.8914 and an MCC of 0.7678 on the independent test set. On a balanced external benchmark of 200 compounds, our model reached an AUC of 0.9108, an accuracy of 0.8500, and an MCC of 0.7000, outperforming LightBBB by an absolute MCC gain of 6%. Case studies further showed that BBBP-Atlas captured clinically meaningful BBB permeability patterns, correctly identifying lorlatinib as BBB-permeable and vancomycin as BBB-impermeable with high confidence. The OmniBBBP-backed BBBP-Atlas offers a versatile and cross-modal approach for single-compound prediction, batch screening, and dataset exploration for CNS drug discovery. BBBP-Atlas is available at https://cadd.drugflow.com/bbbp/.

4

Deep learning models for chemical perturbation prediction do not yet utilise drug molecular features

Bai, J.; Prince, S.; Nitschke, G. S.

2026-05-15 bioinformatics 10.64898/2026.05.13.724458 medRxiv

Top 0.1%

13.0%

Show abstract

Recent deep learning models for L1000 chemical perturbation prediction incorporate dedicated drug molecular encoders. We retrained seven such models from scratch with zeroed or shuffled drug inputs, and compared them with a multilayer perceptron that uses only cell-line basal expression. Under drug-blind evaluation, ablation caused negligible performance changes and the drug-free baseline matched all models. Current architectures do not yet utilise drug molecular features for generalisation to unseen compounds.

5

Stereochemistry-Aware Drug-Target Affinity Prediction

Ferreyra, S.; Dutra, I.; Galeano, A.; Paccanaro, A.

2026-05-18 bioinformatics 10.64898/2026.05.14.725200 medRxiv

Top 0.1%

13.0%

Show abstract

Drug-target affinity (DTA) prediction is a key task in drug discovery, enabling the estimation of the interaction strength between candidate compounds and biological targets. However, current models rely on connectivity-based molecular representations and do not explicitly account for the spatial organization, also known as stereochemistry. This limitation becomes evident when considering chirality, where a drug can exist as enantiomers, i.e., molecules that share the same atoms and bonds but differ in their three-dimensional arrangement. Despite their chemical similarity, they can interact differently with the same target, leading to variations in binding affinity and biological activity. In this paper, we propose a stereochemistry-aware DTA prediction framework that incorporates this information into molecular representations. Drug representations are learned from chemical structure using a directed-bond message passing graph neural network that captures enantiomers configurations, while protein targets are represented through sequence-based embeddings. Experiments on the Davis dataset demonstrate that our model can improve affinity prediction. Importantly, a case study on a manually curated dataset of enantiomers with different biological action shows that the model is able to distinguish the affinities in the two forms consistent with their experimentally observed biological activity. These findings support the relevance of stereochemistry-aware molecular representation for more accurate and chemically faithful DTA prediction.

6

PredHLM: quantitative and interpretable prediction of metabolic half-life in human liver microsomes

Jang, J.; Cho, N.-C.; Oh, K.-S.

2026-07-08 bioinformatics 10.64898/2026.07.02.736062 medRxiv

Top 0.1%

12.3%

Show abstract

Motivation: Human liver microsome (HLM)-based metabolic stability assays are fundamental in early drug discovery, shaping pharmacokinetic profiles and oral bioavailability. However, these experimental assays are labor-intensive and time-consuming, limiting their application in large-scale virtual screening. Computational models can prioritize compounds at scale, yet most are classification-based, leaving quantitative and interpretable prediction of HLM half-life limited. Results: In this study, we developed a quantitative machine learning model for the direct prediction of HLM half-life (T1/2) by integrating 11,790 compounds combining in-house and curated public data. Among various combinations of molecular features and learning algorithms, the XGBoost model with RDKit 2D descriptors achieved the best predictive performance, with an RMSE of 0.507 and an R2 of 0.431 on an independent test set. Shapley Additive Explanations (SHAP) analysis identified lipophilicity and known metabolic soft-spot features as the primary contributors to the predictions. These results suggest that this quantitative approach provides a practical framework for defining metabolic stability margins, thereby supporting rapid Go/No-go decisions in preclinical drug discovery. Availability: The source code, data, and trained model are available at https://github.com/joshua-416/PredHLM.

7

Learning Chirality-Aware Representations to Predict Drug Side Effect Frequencies

Galeano, A.; Dutra, I.; Ferreyra, S.; Paccanaro, A.

2026-05-18 bioinformatics 10.64898/2026.05.14.725209 medRxiv

Top 0.1%

11.0%

Show abstract

Ab initio prediction of side effect frequencies is important for assessing the risk-benefit profile of drugs and for identifying potential adverse effects early in development. A key challenge is chirality: many drugs exist as enantiomers, pairs of molecules with the same atoms and bond connectivity but different three-dimensional arrangements. Although chemically similar, enantiomers can interact differently with biological targets and therefore exhibit distinct efficacy and adverse-effect profiles. Here we introduce F2S (Features to Signatures), a method to predict the frequencies of drug side effects while explicitly accounting for chirality. Drug representations are learned directly from chemical structure using a directed-bond message-passing graph neural network that captures stereochemical configurations. Side effect representations are derived from curated textual descriptions encoded with a frozen PubMedBERT model. Side effect frequencies are predicted from the dot product between drug and side effect signatures together with biases for drugs and side effects. We evaluated F2S extensively across multiple settings, including cold-start and warm-start prediction, prospective evaluation, and scenarios controlling for chemical similarity between training and test drugs. Across these evaluations, F2S achieves performance comparable to state-of-the-art methods for general side-effect frequency prediction while producing fewer false positives and substantially improves the prediction of frequency differences between enantiomer pairs. Finally, F2S learns compact 10-dimensional signatures that support interpretability: drug signatures reflect therapeutic class and shared targets, side-effect signatures capture phenotype similarity, and the learned bias terms correlate with the popularity of drugs and side effects.

8

A foundation model enables prediction of natural product molecular properties, bioactivity, and structural similarity from biosynthetic gene cluster sequence

Walker, A.

2026-07-07 bioinformatics 10.64898/2026.07.05.736569 medRxiv

Top 0.1%

10.9%

Show abstract

Genome mining is a powerful technique in natural product discovery, where biosynthetic gene clusters that are likely to produce novel or desirable natural products are identified through bioinformatic analysis. There are many more predicted biosynthetic gene clusters than can easily be experimentally characterized. Additional computational methods to prioritize biosynthetic gene clusters by the bioactivity, structural properties, or novelty of the product would make genome mining more efficient. Multiple machine learning/artificial intelligence models have been developed to predict product properties from biosynthetic gene cluster sequence, but they are limited by small quantities of training data. Model pretraining with unlabeled data is a powerful technique to develop models that can learn on a limited amount of labeled training data. Biosynthetic gene clusters are well suited to this strategy because there are many predicted clusters with only a small percentage being characterized. This paper reports BGC-MLM, a foundation model that is pretrained with a masked language task on predicted biosynthetic gene clusters and then fine-tuned for downstream applications including prediction of product structural class, bioactivity, chemical properties, counts of functional groups, and chemical fingerprint. Comparison to a model trained without pretraining shows that pretraining generally improves performance. BGC-MLM shows better or similar performance to existing specialized methods for these tasks, demonstrating its utility as a foundation model for natural product genome mining.

9

Structure-guided compound prioritization strategy for virtual screening identifies putative binders for the nuclear receptor LRH-1

Chang-Gonzalez, A. C.; Campbell, A. N.; Bell, E. W.; Blind, R.; Meiler, J.

2026-06-07 bioinformatics 10.64898/2026.06.04.730240 medRxiv

Top 0.1%

10.0%

Show abstract

Compound ranking in structure-based virtual screening notoriously yields highly ranked false positive binders due to variable poses or biases in scoring terms. We developed a compound prioritization strategy that utilizes sampled docked poses from contrasting docking approaches (targeted physics-based docking and blind docking with a generative model) against multiple models of the target protein to train a multi-layer perceptron (MLP). The model predicts binders at the orthosteric ligand-binding pocket of the nuclear receptor LRH-1 (NR5A2). Our approach circumvents the reliance on a single docked pose for scoring compounds or individual scoring metrics for compound ranking. In a separate benchmarking set, we observed that the MLP identifies known binders that are chemically dissimilar from the compounds in the training set and is sensitive to single scaffold modifications, making it a potential tool for lead optimization. We applied our strategy to a prospective virtual screening campaign, which resulted in the discovery of four putative LRH-1 binders. We found that a combination of scoring and prediction metrics enriches for the hit compounds across library sizes. In all, this implementation presents a method to leverage structural and experimental data to aid virtual screening for a challenging protein target.

10

PP-MAPS: dynamic pharmacophore signatures of protein-peptide interfaces from molecular dynamics trajectories

Depenveiller, C.; Guerda, A.; Rabia, E.; Caidi, A.; Ashhab, Y.; Mami-Chouaib, F.; Montes, M.

2026-04-25 bioinformatics 10.64898/2026.04.22.720140 medRxiv

Top 0.1%

9.9%

Show abstract

Protein-peptide interactions underlie many cellular signaling and regulatory processes and are increasingly exploited in drug discovery. Characterizing such interfaces often requires the analysis of ensembles of conformations obtained by molecular modeling or molecular dynamics (MD) simulations, where transient contacts and alternative binding modes can be critical. Pharmacophore models provide an intuitive, transferable representation of molecular interactions. Dynophore i.e. "dynamic pharmacophore" approaches have been developed for small-molecule ligands with MD information. We present PP-MAPS (Protein-Peptide Molecular dynamics Assisted Pharmacophore Signatures), an open-source workflow that extracts and aggregates pharmacophore interactions along MD trajectories of protein-peptide complexes. PP-MAPS produces per-residue interaction frequencies and pharmacophore heatmaps that facilitate comparison of peptides, binding sites and receptor variants. PP-MAPS is implemented in Python and is available under an open-source license at https://github.com/camilledepenveiller/PP-MAPS. The workflow relies on GROMACS for trajectory processing and can use either LigandScout or the Chemical Data Processing Toolkit (CDPKit) for pharmacophore feature detection.

11

DesignMaster: A Multi-Conditional Diffusion Framework for Rational PROTAC Design

Shi, B.; Liu, J.; Pan, T.; Hao, Y.; Isbel, L.; Roy, M. J.; Ng, A. P.; Shang, X.; Li, F.

2026-06-17 bioinformatics 10.64898/2026.06.15.732318 medRxiv

Top 0.1%

9.8%

Show abstract

MotivationProteolysis-targeting chimeras (PROTACs) enable targeted protein degradation through ternary complex formation with E3 ubiquitin ligase. However, the rational design of PROTACs remains highly challenging due to limited structure-activity relationship data and the vast conformational diversity of linkers. Existing computational approaches can be broadly divided into structure-based ternary modelling methods and fragment-based linker generation models. Although these approaches have advanced PROTAC design, they typically neglect key physicochemical constraints and linker-length control during the generation process, causing the generated PROTACs to lack balanced structural properties required for effective ternary complex formation with drug-like characteristics. ResultsTo address these limitations, we propose DesignMaster, a diffusion-based generative framework that explicitly incorporates linker length and physicochemical properties as controllable conditioning signals. DesignMaster employs an E(3)-equivariant graph Transformer with a gated multi-condition fusion module to inject linker length and physicochemical constraints throughout the diffusion process, enabling fine-grained and constraint-aware molecular generation. Experiments on PROTAC-DB 2.0 and 3.0 demonstrate that DesignMaster outperforms state-of-the-art baselines, with a 3.2% improvement in validity and a 34.4% improvement in recovery. The Case study shows DesignMaster achieves a 51.78% reduction in RMSD when predicting the linker of PROTAC BCPyr targeting 6W7O, highlighting its potential for practical structure-guided PROTAC design. AvailabilityThe source code and datasets are available at https://github.com/ABILiLab/DesignMaster.

12

A control-validated pan-proteome deep-learning pipeline nominates GPR35 as a candidate target of the orphan bacterial metabolite ligiamycin A

Martin, J.

2026-07-06 bioinformatics 10.64898/2026.07.01.735807 medRxiv

Top 0.1%

8.2%

Show abstract

Most microbial natural products with documented bioactivity lack an identified molecular target, which limits their development. We present an open, control-validated computational pipeline for natural-product target hypothesis generation. It combines a pan-proteome deep-learning drug-target interaction (DTI) model (a graph neural-network ligand encoder, an ESM-2 protein language-model encoder, and bidirectional cross-attention) with bias-corrected ranking and control-anchored molecular docking. Applying it to ligiamycin A, a 2022-described Streptomyces/Achromobacter co-culture decalin-amino-maleimide with no reported target, we find that the predicted interactions of the compound are dominated by class-A G-protein-coupled receptors. Using a drug with a known target (losartan) we identify and correct a frequent-hitter bias in the raw model; after correction the standout candidates are uniformly class-A GPCRs, led by the orphan receptor GPR35. Structure-based docking with matched positive and negative controls across three candidates corroborates GPR35 specifically: ligiamycin A scores comparably to the known GPR35 agonist zaprinast at the agonist pocket (-8.1 vs -8.3 kcal/mol; non-binder floor -5.5), whereas FFAR1 is excluded and histamine H2 is inconclusive. We propose GPR35 as a prioritized, experimentally testable target and release the workflow as a reusable tool. The result is a computational hypothesis that requires experimental validation.

13

PolyFold: Evaluation of Open-Use Molecular Structure Prediction Algorithms to Inform Their Utility in Diverse Biological Applications

Stephenson, H.; Voicu, D.; Novakov, V.; Levy, M.; Marsilio, J.

2026-06-16 bioengineering 10.64898/2026.06.16.732304 medRxiv

Top 0.1%

7.0%

Show abstract

With the growing use of machine-learning-assisted pipelines for designing, characterizing, and optimizing biomolecules, the reliability of structure prediction models is increasingly important. PolyFold is a benchmarking framework developed to evaluate open-use structure prediction models, Boltz-2 and OpenFold 3, as commercially accessible alternatives to AlphaFold 3. We outline an end-to-end workflow automation tool to streamline input file creation, batch automation, and comprehensive analysis of model outputs for leading open-use structure prediction models. We curated an evaluation dataset of several thousand high-quality Protein Data Bank structures, homology-filtering against the training sets of both models to ensure a fair analysis. We then implemented an evaluation pipeline incorporating structural metrics (RMSD, TM-score, lDDT, etc.), interface metrics (DockQ, ilDDT, iRMSD, etc.), and physicochemical realism checks (based on bond lengths, angles, molecular internal energies, etc.). We identify key performance disparities, observing that Boltz-2 is generally superior to OpenFold 3, though the differential is partially attributable to residual homology leakage not accounted for by prevailing test set curation practices. We thus recommend a new method for homology-reducing when building a test set using length-weighted average fractional identity cutoffs rather than lowest chain fractional identity cutoffs. Even in eliminating residual leakage, Boltz-2 still performs better on full-set comparisons and a variety of important partitions (nucleic acids, protein-ligands, Ab-Ags, etc.). Both models are strong at folding monomeric structures, though struggle with homomultimer placement and small molecule physical realism, demonstrating enduring limitations of machine learning methods. This work is the first end-to-end, open-use, and reproducible platform for systematically assessing state-of-the-art structure prediction models. PolyFold enables practitioners to determine how models compare in performance on specific inference tasks and supports the broader adoption of accessible computational tools to facilitate biomolecular science.

14

MolCodon: A Codon-Based Molecular Language for InterpretableStructural Representation and Similarity Search

Sayyah, E.; Kurul, E.; Tunc, H.; DURDAGI, S.

2026-05-21 bioinformatics 10.64898/2026.05.20.726468 medRxiv

Top 0.1%

7.0%

Show abstract

Molecular representation determines which aspects of chemical structure can be learned, compared, and interpreted in computational drug discovery. Existing encodings typically emphasize either compact string description, as in SMILES and SELFIES, or efficient similarity search, as in circular fingerprints, but they may not simultaneously provide deterministic sequence structure, graph-level interpretability, pharmacophore annotation, and high-fidelity molecular reconstruction. Here, we introduce MolCodon, a codon-based molecular language that represents small molecules as deterministic sequences of fixed-width three-character tokens over a five-symbol alphabet, C, N, O, S, and X. Inspired by the triplet organization of the genetic code, MolCodon assigns chemically defined codon families to atoms, bonds, ring and branch topology, fused-ring references, pharmacophore features, bond mobility, charge, and stereochemistry. A deterministic graph traversal with ring-contiguity preservation produces sequences in which chemically meaningful substructures remain locally organized and traceable to the underlying molecular graph. Across around 2,9 million molecules from six commercial screening libraries, MolCodon achieved 98.93% InChIKey-level round-trip fidelity, supporting its use as a high-fidelity sequence representation for drug-like chemistry. MolCodon-derived sparse sequence and trace features further outperformed SELFIES and Group SELFIES across ten QSAR tasks and exceeded classical fingerprint baselines in six out of ten tasks. As an application of the representation, MolCodon BLAST similarity engine decomposes molecular similarity into ring topology, branch context, attachment architecture, and pharmacophore correspondence, enabling interpretable scaffold-hopping searches. In a PARP1 virtual screening study, MolCodon retrieved scaffold-diverse candidates to a known PARP-1 inhibitor Olaparib. Together, these results establish MolCodon as a new molecular representation paradigm that transforms chemical graphs into high-fidelity, interpretable, and alignment-compatible codon sequences, opening a direct path for bioinformatics-inspired analysis of small-molecule chemical space. The MolCodon encoder, decoder, and BLAST similarity engine are freely available as open-source software at https://github.com/DurdagiLab/MolCodon

15

Predicting P-glycoprotein Substrate Status Using a Pretrained Graph Neural Network: A TDC Benchmark Study

Yan, J.; Duan, W.

2026-06-04 bioinformatics 10.64898/2026.06.01.729343 medRxiv

Top 0.1%

6.9%

Show abstract

P-glycoprotein (Pgp/ABCB1) is a critical efflux transporter that significantly impacts drug bioavailability and multidrug resistance. Accurate prediction of Pgp substrate status is essential for early-stage drug discovery. In this study, we evaluate a pretrained Graph Iso-morphism Network (GIN) with attribute masking on the Pgp_Broccatelli benchmark from the Therapeutics Data Commons (TDC). Our approach fine-tunes a GIN encoder pretrained on approximately 2 million molecules using a self-supervised attribute masking strategy, followed by a multilayer perceptron (MLP) classification head. On the TDC benchmark, our model achieves an AUROC of 0.937 {+/-} 0.004 across five independent runs, ranking second on the leaderboard, as of May 2026. We further compare this approach against an XGBoost baseline using Morgan fingerprints (AUROC 0.912 {+/-} 0.007), demonstrating the advantage of graph-based molecular representations with transfer learning for small-dataset ADMET prediction tasks.

16

Attracting Cavities 3.0: Faster and More Versatile Molecular Docking for the SwissDock Webserver

Roehrig, U. F.; Mathieu-Bugnon, M.; Zoete, V.

2026-04-23 bioinformatics 10.64898/2026.04.21.719847 medRxiv

Top 0.1%

6.8%

Show abstract

MotivationMolecular docking is a pillar of structure-based drug design and shows advantages in structure prediction of small-molecule ligand-protein complexes over co-folding methods for novel ligands and novel binding pockets. Here, we describe substantial improvements of our physics-based docking algorithm Attracting Cavities, which is widely used through the SwissDock webserver. ResultsAC 3.0 includes enhanced sampling features, new functionalities, and technical improvements. These lead to better sampling at lower execution times and higher versatility. Comparison with AutoDock Vina demonstrates better docking results on multiple test sets. AvailabilityAC 3.0 will be made available free of charge through the SwissDock webserver (www.swissdock.ch).

17

MolMAE: A Surface-Centric Multimodal Masked Autoencoder for Molecular Representation Learning

Li, J.

2026-07-14 bioinformatics 10.64898/2026.07.11.737987 medRxiv

Top 0.1%

6.8%

Show abstract

Molecular representation learning has become a central component of modern computational drug discovery. Existing molecular foundation models mainly rely on SMILES strings, two-dimensional molecular graphs, or three-dimensional atomic coordinates. However, many molecular properties are ultimately governed by the molecular surface, where intermolecular recognition, solvation, electrostatic complementarity, and ligand-protein interactions occur. In this work, we propose MolMAE, a surface-guided multimodal masked autoencoder for molecular representation learning. MolMAE takes molecular surface point clouds, three-dimensional molecular graphs, and SMILES-derived fragment and functional-group tokens as complementary input modalities, and learns a unified multimodal molecular embedding through functional-group-aligned masked autoencoding. During pretraining, chemically corresponding local regions are jointly masked across surface, graph, fragment, and functional-group views, forcing the model to reconstruct missing geometric, physicochemical, structural, and semantic information from the remaining context. While molecular surface reconstruction serves as the primary pretraining objective, graph-, fragment-, and functional-group-level reconstruction tasks provide complementary supervision that encourages the model to capture molecular topology, bonding patterns, stereochemistry, local chemical environments, and substructure organization. In addition to reconstructing surface geometry, MolMAE reconstructs surface-associated physicochemical fields, including electrostatic potential and Fukui-related descriptors, enabling the model to learn chemically meaningful surface representations. Pretrained on approximately 261K lead-like bioactive molecules, MolMAE achieves strong performance on the ESOL benchmark under scaffold splitting and competitive performance across multiple molecular property prediction tasks. These results suggest that molecular surface-guided pretraining can complement conventional graph-, sequence-, and atom-coordinate-based molecular representations, especially for property prediction tasks influenced by exposed surface geometry and surface-associated physicochemical patterns.

18

ConfDock: Atom-specific Uncertainty Quantification for Molecular Docking via Conformal Prediction

Hao, H.; Elhendawy, N.; Wang, Y.; Lu, C.

2026-07-01 biochemistry 10.64898/2026.06.29.735353 medRxiv

Top 0.1%

6.8%

Show abstract

Molecular docking is widely used in structure-based drug discovery, yet most approaches provide point estimates without rigorous uncertainty quantification. This limitation makes it difficult to assess when a predicted pose should be trusted, especially when docking methods are applied to diverse protein-ligand systems. We present ConfDock, a conformal prediction (CP) framework for constructing atom-specific prediction intervals for ligand docking poses. ConfDock combines graph neural network (GNN) based quantile estimation with split conformal calibration, producing intervals that adapt to local protein-ligand environments while retaining distribution-free finite-sample coverage guarantees. We evaluate ConfDock on 238 protein-ligand complexes across four docking methods representing distinct computational paradigms. The proposed approach yields substantially narrower prediction intervals compared to standard split CP (57.2% average reduction in mean interval width, up to 74.5%) while maintaining target coverage across all evaluated settings. Ablation analysis indicates that the GNN captures the dominant structure-dependent variability in uncertainty, whereas the conformal calibration step provides a bounded adjustment to ensure coverage guarantees. These results demonstrate that combining learned, structure-aware quantile estimation with conformal calibration enables rigorous uncertainty quantification for molecular docking at atom-level resolution.

19

Do Larger Models Really Win in Drug Discovery?A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

Guo, J.

2026-05-04 bioinformatics 10.64898/2026.04.29.721568 medRxiv

Top 0.1%

6.7%

Show abstract

The rapid growth of molecular foundation models and large language models has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and graph neural networks (GNNs) trained for individual tasks. We test this assumption across 26 endpoints for molecular properties, toxicity, safety liabilities and biological activity, grouped into ADME, toxicity and bioactivity classes. The benchmark contains 78 endpoint and split entries spanning random, Murcko scaffold and structure separated 5-fold CV. Ordered from easiest to hardest, these splits approximate retrospective evaluation on a closed library, scaffold expansion in hit to lead, and library expansion on novel chemotypes. Each entry includes ML, GNN, pretrained molecular sequence and LLM based SAR families. Across 156 fold mean comparisons, classical ML such as RF(ECFP4) and ExtraTrees(RDKit) win 116, GNNs such as GIN and Ligandformer win 25, pretrained sequence models such as MoLFormer and ChemBERTa2 win 12, and LLM based SAR baselines win three. ML dominates random split interpolation but loses part of this advantage under harder splits; GNN and sequence models also decline but gain relative ground, whereas LLM based SAR is weaker in absolute terms yet less sensitive to the split axis. Paired bootstrap analyses support family level trends more strongly than individual model rankings. SAR knowledge derived from training folds improves many GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective for molecular property and activity prediction. Larger models add value for SAR interpretation and reasoning in low data settings, but predictive performance depends on the fit among model, task and validation scenario, not on scale alone.

20

Multi-level, multi-body atomic interaction graphs for machine learning-based prediction of protein-ligand binding energies

Le, T. T. H.; Nguyen, B. T.; Vo, H.; Nguyen, N. H.; Nguyen, D. D.

2026-06-07 bioinformatics 10.64898/2026.06.05.730001 medRxiv

Top 0.1%

6.7%

Show abstract

Accurate prediction of binding affinity is crucial for rational drug design and discovery. Traditional computational methods often rely on complex scoring functions that incorporate a multitude of physical and chemical descriptors, leading to high computational demands and sometimes limited generalizability. In this work, we propose a novel scoring function that models multi-level, multi-body atomic interactions using graph-based representations. Our method constructs comprehensive interaction graphs that incorporate both pairwise and triplet-wise atomic features that help capture cooperative spatial patterns essential for binding affinity prediction. By employing a feature fusion strategy, GMI-Score maintains model simplicity while enhancing accuracy. Extensive evaluation across multiple datasets, such as PDBbind v2013, PDBbind v2016, PDBbind v2020, CSAR-NRC-HiQ, and PDBbind-Redocked, demonstrates that our model consistently outperforms state-of-the-art scoring functions, achieving Pearson correlation coefficients up to 0.877. Furthermore, it retains strong predictive power under strict data leakage controls and realistic docking conditions to high-light its robustness and generalizability. Scientific ContributionIn this study, we present a scoring methodology that systematically captures higher-order atomic interactions within a unified graph framework, making a conceptual shift in cheminformatics scoring functions. Its consistent outperformances of existing methods and strong validity under redocked and withheld atascenarios demonstrate its utility for broad-scale molecular modeling applications and open heminformaticsworkflows.